智能论文笔记

RF Fingerprinting Needs Attention: Multi-task Approach for Real-World WiFi and Bluetooth

Anu Jagannath , Zackary Kane , Jithin Jagannath

分类：机器学习

2022-09-07

这项工作介绍了一种新型的跨域注意力构建体系结构-XDOR-用于健壮的现实世界无线射频（RF）指纹识别。据我们所知，这是第一次将这种全面的注意机制应用于解决RF指纹问题。在本文中，我们在室内实验测试台上，在丰富的多径和不可避免的干扰环境中诉诸现实世界的物联网和蓝牙（BT）排放（而不是合成波形的产生）。我们通过包括在几个月内收集的波形来显示捕获时间框架的影响，并演示相同的时间框架和多个时间框架指纹评估。通过进行单任务和多任务模型分析，在实验中证明了诉诸多任务结构的有效性。最后，我们通过对针对众所周知的指纹最新模型进行基准测试，证明了拟议的Xdom体系结构获得的性能显着增长。具体而言，我们在单任务WiFi和BT指纹印刷下报告的性能提高高达59.3％和4.91倍，在多任务设置下指纹准确性提高了50.5％。

translated by 谷歌翻译

A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points

Daniel M. Kane , Ilias Diakonikolas

分类：机器学习 | (统计)机器学习

2022-12-21

We prove that for $c>0$ a sufficiently small universal constant that a random set of $c d^2/\log^4(d)$ independent Gaussian random points in $\mathbb{R}^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of~\cite{SaundersonCPW12}, within logarithmic factors. The latter conjecture has attracted significant attention over the past decade, due to its connections to machine learning and sum-of-squares lower bounds for certain statistical problems.

translated by 谷歌翻译

A Twitter BERT Approach for Offensive Language Detection in Marathi

Tanmay Chavan , Shantanu Patankar , Aditya Kane , Omkar Gokhale , Raviraj Joshi

分类：自然语言处理

2022-12-20

Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.

translated by 谷歌翻译

A Strongly Polynomial Algorithm for Approximate Forster Transforms and its Application to Halfspace Learning

Ilias Diakonikolas , Christos Tzamos , Daniel M. Kane

分类：机器学习 | (统计)机器学习

2022-12-06

The Forster transform is a method of regularizing a dataset by placing it in {\em radial isotropic position} while maintaining some of its essential properties. Forster transforms have played a key role in a diverse range of settings spanning computer science and functional analysis. Prior work had given {\em weakly} polynomial time algorithms for computing Forster transforms, when they exist. Our main result is the first {\em strongly polynomial time} algorithm to compute an approximate Forster transform of a given dataset or certify that no such transformation exists. By leveraging our strongly polynomial Forster algorithm, we obtain the first strongly polynomial time algorithm for {\em distribution-free} PAC learning of halfspaces. This learning result is surprising because {\em proper} PAC learning of halfspaces is {\em equivalent} to linear programming. Our learning approach extends to give a strongly polynomial halfspace learner in the presence of random classification noise and, more generally, Massart noise.

translated by 谷歌翻译

Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

Ilias Diakonikolas , Daniel M. Kane , Jasper C. H. Lee , Ankit Pensia

分类：机器学习 | (统计)机器学习

2022-11-29

We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.

translated by 谷歌翻译

Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection

Omkar Gokhale , Aditya Kane , Shantanu Patankar , Tanmay Chavan , Raviraj Joshi

分类：自然语言处理 | 人工智能

2022-10-09

Pre-training large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Although this method has proven to be effective for many domains, it might not always provide desirable benefits. In this paper, we study the effects of hateful pre-training on low-resource hate speech classification tasks. While previous studies on the English language have emphasized its importance, we aim to augment their observations with some non-obvious insights. We evaluate different variations of tweet-based BERT models pre-trained on hateful, non-hateful, and mixed subsets of a 40M tweet dataset. This evaluation is carried out for the Indian languages Hindi and Marathi. This paper is empirical evidence that hateful pre-training is not the best pre-training option for hate speech detection. We show that pre-training on non-hateful text from the target domain provides similar or better results. Further, we introduce HindTweetBERT and MahaTweetBERT, the first publicly available BERT models pre-trained on Hindi and Marathi tweets, respectively. We show that they provide state-of-the-art performance on hate speech classification tasks. We also release hateful BERT for the two languages and a gold hate speech evaluation benchmark HateEval-Hi and HateEval-Mr consisting of manually labeled 2000 tweets each. The models and data are available at https://github.com/l3cube-pune/MarathiNLP .

translated by 谷歌翻译

Continual VQA for Disaster Response Systems

Aditya Kane , V Manushree , Sahil Khose

分类：计算机视觉

2022-09-21

视觉问题回答（VQA）是一项多模式的任务，涉及从输入图像中回答问题，以语义了解图像的内容并以自然语言回答。由于VQA系统回答的问题范围，使用VQA进行灾难管理是一项重要的研究。但是，主要的挑战是评估受影响地区的标签产生的延迟。为了解决这个问题，我们部署了预先训练的剪辑模型，该模型在视觉图像对中进行了训练。但是，我们从经验上看到该模型的零击性能差。因此，我们相反，我们使用此模型中的文本和图像的预训练嵌入，进行我们的监督培训，并超过Floodnet数据集上的先前最新结果。我们将其扩展到持续的设置，这是一种更现实的情况。我们解决了使用各种经验重播方法的灾难性遗忘的问题。我们的培训运行可在以下网址提供：https：//wandb.ai/compyle/continual_vqa_final

translated by 谷歌翻译

Efficient Gender Debiasing of Pre-trained Indic Language Models

Neeraja Kirtane , V Manushree , Aditya Kane

分类：自然语言处理

2022-09-08

在使用这些模型的系统中，数据中存在的性别偏差会反映在哪些语言模型中进行培训。该模型的内在性别偏见显示了我们文化中妇女的过时和不平等的看法，并鼓励歧视。因此，为了建立更公平的系统并提高公平性，识别和减轻这些模型中存在的偏见至关重要。尽管这一领域的英语工作大量工作，但在其他性别和低资源语言，尤其是印度语言中，缺乏研究。英语是一种非性别语言，它具有无性别名词。英语中偏见检测的方法论不能直接用其他性别语言来部署，语法和语义有所不同。在我们的论文中，我们衡量与印地语语言模型中职业相关的性别偏见。我们在本文中的主要贡献是构建一种新型语料库，以评估印地语中的职业性别偏见，使用定义明确的度量来量化这些系统中现有的偏见，并通过有效地微调我们的模型来减轻它。我们的结果反映出，我们提出的缓解技术的引入后减少了偏见。我们的代码库可公开使用。

translated by 谷歌翻译

Cryptographic Hardness of Learning Halfspaces with Massart Noise

Ilias Diakonikolas , Daniel M. Kane , Pasin Manurangsi , Lisheng Ren

分类：机器学习

2022-07-28

我们研究了Massart噪声存在下PAC学习半空间的复杂性。在这个问题中，我们得到了I.I.D.标记的示例$（\ mathbf {x}，y）\ in \ mathbb {r}^n \ times \ {\ pm 1 \} $，其中$ \ mathbf {x} $的分布是任意的，标签$ y y y y y y。 $是$ f（\ mathbf {x}）$的MassArt损坏，对于未知的半空间$ f：\ mathbb {r}^n \ to \ to \ {\ pm 1 \} $，带有翻转概率$ \ eta（\ eta）（\ eta） Mathbf {x}）\ leq \ eta <1/2 $。学习者的目的是计算一个小于0-1误差的假设。我们的主要结果是该学习问题的第一个计算硬度结果。具体而言，假设学习错误（LWE）问题（LWE）问题的（被认为是广泛的）超指定时间硬度，我们表明，即使最佳，也没有多项式时间MassArt Halfspace学习者可以更好地达到错误的错误，即使是最佳0-1错误很小，即$ \ mathrm {opt} = 2^{ - \ log^{c}（n）} $对于任何通用常数$ c \ in（0，1）$。先前的工作在统计查询模型中提供了定性上类似的硬度证据。我们的计算硬度结果基本上可以解决Massart Halfspaces的多项式PAC可学习性，这表明对该问题的已知有效学习算法几乎是最好的。

translated by 谷歌翻译

DataPerf: Benchmarks for Data-Centric AI Development

Mark Mazumder , Colby Banbury , Xiaozhe Yao , Bojan Karlaš , William Gaviria Rojas , Sudnya Diamos , Greg Diamos , Lynn He , Douwe Kiela , David Jurado

分类：机器学习

2022-07-20

机器学习（ML）研究通常集中在模型上，而最突出的数据集已用于日常的ML任务，而不考虑这些数据集对基本问题的广度，困难和忠诚。忽略数据集的基本重要性已引起了重大问题，该问题涉及现实世界中的数据级联以及数据集驱动标准的模型质量饱和，并阻碍了研究的增长。为了解决此问题，我们提出Dataperf，这是用于评估ML数据集和数据集工作算法的基准软件包。我们打算启用“数据棘轮”，其中培训集将有助于评估相同问题的测试集，反之亦然。这种反馈驱动的策略将产生一个良性的循环，该循环将加速以数据为中心的AI。MLCommons协会将维护Dataperf。

translated by 谷歌翻译